Abundance estimation and differential testing on strain level in metagenomics data
نویسندگان
چکیده
Motivation Current metagenomics approaches allow analyzing the composition of microbial communities at high resolution. Important changes to the composition are known to even occur on strain level and to go hand in hand with changes in disease or ecological state. However, specific challenges arise for strain level analysis due to highly similar genome sequences present. Only a limited number of tools approach taxa abundance estimation beyond species level and there is a strong need for dedicated tools for strain resolution and differential abundance testing. Methods We present DiTASiC ( fferential axa bundance including milarity orrection) as a novel approach for quantification and differential assessment of individual taxa in metagenomics samples. We introduce a generalized linear model for the resolution of shared read counts which cause a significant bias on strain level. Further, we capture abundance estimation uncertainties, which play a crucial role in differential abundance analysis. A novel statistical framework is built, which integrates the abundance variance and infers abundance distributions for differential testing sensitive to strain level. Results As a result, we obtain highly accurate abundance estimates down to sub-strain level and enable fine-grained resolution of strain clusters. We demonstrate the relevance of read ambiguity resolution and integration of abundance uncertainties for differential analysis. Accurate detections of even small changes are achieved and false-positives are significantly reduced. Superior performance is shown on latest benchmark sets of various complexities and in comparison to existing methods. Availability and Implementation DiTASiC code is freely available from https://rki_bioinformatics.gitlab.io/ditasic . Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.
منابع مشابه
Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance
MOTIVATION Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic ...
متن کاملBracken: estimating species abundance in metagenomics data
Metagenomic experiments attempt to characterize microbial communities using high-throughput DNA sequencing. Identification of the microorganisms in a sample provides information about the genetic profile, population structure, and role of microorganisms within an environment. Until recently, most metagenomics studies focused on high-level characterization at the level of phyla, or alternatively...
متن کاملPipasic: similarity and expression correction for strain-level identification and quantification in metaproteomics
MOTIVATION Metaproteomic analysis allows studying the interplay of organisms or functional groups and has become increasingly popular also for diagnostic purposes. However, difficulties arise owing to the high sequence similarity between related organisms. Further, the state of conservation of proteins between species can be correlated with their expression level, which can lead to significant ...
متن کاملRobust statistical methods for differential abundance analysis of metagenomics data
This document outlines my 2011-2012 AMSC project for the 663/664 course series and in particular the mid-year progress. It is an ever evolving document. The project is to develop Metastats 2.0, a software package analyzing metagenomic data. We propose two major extensions and modifications to the Metastats software and the underlying statistical methods. The first extension of Metastats is a mi...
متن کاملMetagenomic abundance estimation and diagnostic testing on species level
One goal of sequencing-based metagenomic community analysis is the quantitative taxonomic assessment of microbial community compositions. In particular, relative quantification of taxons is of high relevance for metagenomic diagnostics or microbial community comparison. However, the majority of existing approaches quantify at low resolution (e.g. at phylum level), rely on the existence of speci...
متن کامل